Goto

Collaborating Authors

 adversarial attack


Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs

arXiv.org Machine Learning

Multimodal large language models (MLLMs) remain vulnerable to transfer-based targeted attacks, where perturbations optimized on open-source surrogate encoders can generalize to closed-source MLLMs. A key challenge for improving adversarial transferability is to effectively capture the intrinsic visual focus shared across different models, such that perturbations align with transferable semantic cues rather than surrogate-specific behaviors. However, existing methods suffer from spatial-domain feature redundancy and surrogate-specific gradient signals, thereby hindering cross-model transferability. In this paper, we propose FRA-Attack, which addresses both challenges from a unified frequency-domain regularization perspective. For feature alignment, a high-pass DCT objective on patch features suppresses redundant global structures and concentrates the loss on the high-frequency band that carries the MLLMs' intrinsic visual focus. For gradient optimization, we introduce Frequency-domain Gradient Regularization (FGR), a \textit{model-agnostic} low-pass regularizer that modulates the surrogate gradient using only the geometric frequency coordinate, \textit{i.e.}, no surrogate-derived statistic is involved, so that FGR is model-agnostic by construction, removing surrogate-specific high-frequency artifacts while preserving transferable low-frequency directions. Together, the two components form a unified frequency-domain treatment of transferability. Extensive experiments on $15$ flagship MLLMs across $7$ vendors show that FRA-Attack achieves superior cross-model transferability, particularly with state-of-the-art performance on GPT-5.4, Claude-Opus-4.6 and Gemini-3-flash.



16d11e9595188dbad0418a85f0351aba-Supplemental.pdf

Neural Information Processing Systems

This section introduces more backgrounds on poisoning attacks and backdoor attacks, and details on the adversarial attacks that we use to craft accumulative poisoning samples in our methods. Finally, we describe the commonly used anomaly detection methods against adversarially crafted samples, following previous settings [40]. B.1 Poisoning attacks and backdoor attacks There is extensive prior work on poisoning attacks, especially in the offline settings against SVM [3], logistic regression [36], collaborative filtering [27], feature selection [54], clustering [8], and neural networks [9, 21, 22, 38, 50]. Poisoning attacks in real-time data streaming are studied on online SVM [4], autoregressive models [1, 7], bandit algorithms [20, 31, 33], and classification [26, 52, 57]. Compared to poisoning attacks, backdoor attacks draw attention in more recent researches.


Backpropagating Linearly Improves Transferability of Adversarial Examples (Supplementary Material)

Neural Information Processing Systems

Empirical results in Section 3.1 in the main paper show that simply removing ReLUs lead to improved transferability. In this section, we try freezing all learnable parameters in the unmodified sub-net h during fine-tuning and a similar observation about the initial improvement of transferability can still be decrease made and (see finally Figure the 5). Classification loss of these modified VGG-19 models on the benign CIFAR-10 test set is also reported, in Figure 6. On ImageNet, it is evaluated on the 50000official validation images. As mentioned in the main paper, many recent successes in improving adversarial transferability benefit from maximizing intermediate level distortions rather than the final prediction losses [8, 3, 2] of DNNs.


Backpropagating Linearly Improves Transferability of Adversarial Examples

Neural Information Processing Systems

The vulnerability of deep neural networks (DNNs) to adversarial examples has drawn great attention from the community. In this paper, we study the transferability of such examples, which lays the foundation of many black-box attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis of Goodfellow et al.'s and disclose that the transferability can be enhanced by improving the linearity of DNNs in an appropriate manner. We introduce linear backpropagation (LinBP), a method that performs backpropagation in a more linear fashion using off-the-shelf attacks that exploit gradients. More specifically, it calculates forward as normal but backpropagates loss as if some nonlinear activations are not encountered in the forward pass. Experimental results demonstrate that this simple yet effective method obviously outperforms current state-of-the-arts in crafting transferable adversarial examples on CIFAR-10 and ImageNet, leading to more effective attacks on a variety of DNNs.




Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition

Neural Information Processing Systems

Deep learning models have shown their vulnerability when dealing with adversarial attacks. Existing attacks almost perform on low-level instances, such as pixels and super-pixels, and rarely exploit semantic clues. For face recognition attacks, existing methods typically generate the ℓp-norm perturbations on pixels, however, resulting in low attack transferability and high vulnerability to denoising defense models. In this work, instead of performing perturbations on the low-level pixels, we propose to generate attacks through perturbing on the high-level semantics to improve attack transferability. Specifically, a unified flexible framework, Adversarial Attributes (Adv-Attribute), is designed to generate inconspicuous and transferable attacks on face recognition, which crafts the adversarial noise and adds it into different attributes based on the guidance of the difference in face recognition features from the target. Moreover, the importance-aware attribute selection and the multi-objective optimization strategy are introduced to further ensure the balance of stealthiness and attacking strength. Extensive experiments on the FFHQ and CelebA-HQ datasets show that the proposed Adv-Attribute method achieves the state-of-the-art attacking success rates while maintaining better visual effects against recent attack methods.


fb4c48608ce8825b558ccf07169a3421-Supplemental.pdf

Neural Information Processing Systems

In this section, we perform additional diagnostics that give us confidence that our models are not doing any form of gradient obfuscation or masking [3, 53]. First, we report in Table 4 the robust accuracy obtained by our strongest models against a diverse set of attacks. The cascade is composed as follows: AUTOPGD-CE, an untargeted attack using PGD with an adaptive step on the cross-entropy loss [10], AUTOPGD-T, a targeted attack using PGD with an adaptive step on the difference of logits ratio [10], FAB-T, a targeted attack which minimizes the norm of adversarial perturbations [9], SQUARE, a query-efficient black-box attack [1]. First, we observe that our combination of attacks, denoted AA+MT matches the final robust accuracy measured by AUTOATTACK. Second, we also notice that the black-box attack (i.e., SQUARE) does not find any additional adversarial examples.